Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Free, publicly-accessible full text available January 1, 2026
-
This work introduces TrialSieve, a novel framework for biomedical information extraction that enhances clinical meta-analysis and drug repurposing. By extending traditional PICO (Patient, Intervention, Comparison, Outcome) methodologies, TrialSieve incorporates hierarchical, treatment group-based graphs, enabling more comprehensive and quantitative comparisons of clinical outcomes. TrialSieve was used to annotate 1609 PubMed abstracts, 170,557 annotations, and 52,638 final spans, incorporating 20 unique annotation categories that capture a diverse range of biomedical entities relevant to systematic reviews and meta-analyses. The performance (accuracy, precision, recall, F1-score) of four natural-language processing (NLP) models (BioLinkBERT, BioBERT, KRISSBERT, PubMedBERT) and the large language model (LLM), GPT-4o, was evaluated using the human-annotated TrialSieve dataset. BioLinkBERT had the best accuracy (0.875) and recall (0.679) for biomedical entity labeling, whereas PubMedBERT had the best precision (0.614) and F1-score (0.639). Error analysis showed that NLP models trained on noisy, human-annotated data can match or, in most cases, surpass human performance. This finding highlights the feasibility of fully automating biomedical information extraction, even when relying on imperfectly annotated datasets. An annotator user study (n = 39) revealed significant (p < 0.05) gains in efficiency and human annotation accuracy with the unique TrialSieve tree-based annotation approach. In summary, TrialSieve provides a foundation to improve automated biomedical information extraction for frontend clinical research.more » « lessFree, publicly-accessible full text available May 1, 2026
-
Chronic myeloid leukemia (CML) is treated with tyrosine kinase inhibitors (TKI) that target the pathological BCR-ABL1 fusion oncogene. The objective of this statistical meta-analysis was to assess the prevalence of other hematological adverse events (AEs) that occur during or after predominantly first-line treatment with TKIs. Data from seventy peer-reviewed, published studies were included in the analysis. Hematological AEs were assessed as a function of TKI drug type (dasatinib, imatinib, bosutinib, nilotinib) and CML phase (chronic, accelerated, blast). AE prevalence aggregated across all severities and phases was significantly different between each TKI (p < 0.05) for anemia—dasatinib (54.5%), bosutinib (44.0%), imatinib (32.8%), nilotinib (11.2%); neutropenia—dasatinib (51.2%), imatinib (29.8%), bosutinib (14.1%), nilotinib (14.1%); thrombocytopenia—dasatinib (62.2%), imatinib (30.4%), bosutinib (35.3%), nilotinib (22.3%). AE prevalence aggregated across all severities and TKIs was significantly (p < 0.05) different between CML phases for anemia—chronic (28.4%), accelerated (66.9%), blast (55.8%); neutropenia—chronic (26.7%), accelerated (63.8%), blast (36.4%); thrombocytopenia—chronic (33.3%), accelerated (65.6%), blast (37.9%). An odds ratio (OR) with 95% confidence interval was used to compare hematological AE prevalence of each TKI compared to the most common first-line TKI therapy, imatinib. For anemia, dasatinib OR = 1.65, [1.51, 1.83]; bosutinib OR = 1.34, [1.16, 1.54]; nilotinib OR = 0.34, [0.30, 0.39]. For neutropenia, dasatinib OR = 1.72, [1.53, 1.92]; bosutinib OR = 0.47, [0.38, 0.58]; nilotinib OR = 0.47, [0.42, 0.54]. For thrombocytopenia, dasatinib OR = 2.04, [1.82, 2.30]; bosutinib OR = 1.16, [0.97, 1.39]; nilotinib OR = 0.73, [0.65, 0.82]. Nilotinib had the greatest fraction of severe (grade 3/4) hematological AEs (30%). In conclusion, the overall prevalence of hematological AEs by TKI type was: dasatinib > bosutinib > imatinib > nilotinib. Study limitations include inability to normalize for dosage and treatment duration.more » « less
-
Multiple studies have reported new or exacerbated persistent or resistant hypertension in patients previously infected with COVID-19. We used literature-based discovery to identify and prioritize multi-scalar explanatory biology that relates resistant hypertension to COVID-19. Cross-domain text mining of 33+ million PubMed articles within a comprehensive knowledge graph was performed using SemNet 2.0. Unsupervised rank aggregation determined which concepts were most relevant utilizing the normalized HeteSim score. A series of simulations identified concepts directly related to COVID-19 and resistant hypertension or connected via one of three renin–angiotensin–aldosterone system hub nodes (mineralocorticoid receptor, epithelial sodium channel, angiotensin I receptor). The top-ranking concepts relating COVID-19 to resistant hypertension included: cGMP-dependent protein kinase II, MAP3K1, haspin, ral guanine nucleotide exchange factor, N-(3-Oxododecanoyl)-L-homoserine lactone, aspartic endopeptidases, metabotropic glutamate receptors, choline-phosphate cytidylyltransferase, protein tyrosine phosphatase, tat genes, MAP3K10, uridine kinase, dicer enzyme, CMD1B, USP17L2, FLNA, exportin 5, somatotropin releasing hormone, beta-melanocyte stimulating hormone, pegylated leptin, beta-lipoprotein, corticotropin, growth hormone-releasing peptide 2, pro-opiomelanocortin, alpha-melanocyte stimulating hormone, prolactin, thyroid hormone, poly-beta-hydroxybutyrate depolymerase, CR 1392, BCR-ABL fusion gene, high density lipoprotein sphingomyelin, pregnancy-associated murine protein 1, recQ4 helicase, immunoglobulin heavy chain variable domain, aglycotransferrin, host cell factor C1, ATP6V0D1, imipramine demethylase, TRIM40, H3C2 gene, COL1A1+COL1A2 gene, QARS gene, VPS54, TPM2, MPST, EXOSC2, ribosomal protein S10, TAP-144, gonadotropins, human gonadotropin releasing hormone 1, beta-lipotropin, octreotide, salmon calcitonin, des-n-octanoyl ghrelin, liraglutide, gastrins. Concepts were mapped to six physiological themes: altered endocrine function, 23.1%; inflammation or cytokine storm, 21.3%; lipid metabolism and atherosclerosis, 17.6%; sympathetic input to blood pressure regulation, 16.7%; altered entry of COVID-19 virus, 14.8%; and unknown, 6.5%.more » « less
-
A major bottleneck preventing the extension of deep learning systems to new domains is the prohibitive cost of acquiring sufficient training labels. Alternatives such as weak supervision, active learning, and fine-tuning of pretrained models reduce this burden but require substantial human input to select a highly informative subset of instances or to curate labeling functions. REGAL (Rule-Enhanced Generative Active Learning) is an improved framework for weakly supervised text classification that performs active learning over labeling functions rather than individual instances. REGAL interactively creates high-quality labeling patterns from raw text, enabling a single annotator to accurately label an entire dataset after initialization with three keywords for each class. Experiments demonstrate that REGAL extracts up to 3 times as many high-accuracy labeling functions from text as current state-of-the-art methods for interactive weak supervision, enabling REGAL to dramatically reduce the annotation burden of writing labeling functions for weak supervision. Statistical analysis reveals REGAL performs equal or significantly better than interactive weak supervision for five of six commonly used natural language processing (NLP) baseline datasets.more » « less
-
Optimizations for Computing Relatedness in Biomedical Heterogeneous Information Networks: SemNet 2.0Literature-based discovery (LBD) summarizes information and generates insight from large text corpuses. The SemNet framework utilizes a large heterogeneous information network or “knowledge graph” of nodes and edges to compute relatedness and rank concepts pertinent to a user-specified target. SemNet provides a way to perform multi-factorial and multi-scalar analysis of complex disease etiology and therapeutic identification using the 33+ million articles in PubMed. The present work improves the efficacy and efficiency of LBD for end users by augmenting SemNet to create SemNet 2.0. A custom Python data structure replaced reliance on Neo4j to improve knowledge graph query times by several orders of magnitude. Additionally, two randomized algorithms were built to optimize the HeteSim metric calculation for computing metapath similarity. The unsupervised learning algorithm for rank aggregation (ULARA), which ranks concepts with respect to the user-specified target, was reconstructed using derived mathematical proofs of correctness and probabilistic performance guarantees for optimization. The upgraded ULARA is generalizable to other rank aggregation problems outside of SemNet. In summary, SemNet 2.0 is a comprehensive open-source software for significantly faster, more effective, and user-friendly means of automated biomedical LBD. An example case is performed to rank relationships between Alzheimer’s disease and metabolic co-morbidities.more » « less
An official website of the United States government

Full Text Available